110

Binary Neural Architecture Search

We use Eqs. 4.30 – 4.31 to jointly learn the DCP-NAS and rewrite the objective function

in Eq. 4.26 as

LDCP-NAS( ˜f P (w, α),

˜f C

b ( ˆw, ˆα, β))

=G( ˆw, ˆα, β) + λDα) + μLR( ˆw, β).

(4.32)

Then we optimize the binary architecture ˆα along the tangent direction of the real-valued

model, which inherits from the real-valued one. Note that when we set λ = 0, the Eq.

4.32 is equivalent to the objective of the original CP-NAS [304]. As revealed in [157], the

real-valued weights converge faster than the binarized ones. Motivated by this observation,

the tangent direction of the Parent supernet can be used to approximate the optimization

direction of the more slowly converged Child supernet. To conclude, in Eq. 4.31, we improve

the optimization of the Child architecture based on the tangent direction of the Parent

architecture, which leads the Child supernet to be trained more efficiently.

Considering that binary weights are learned by KL divergence, we optimize our DCP-

NAS as

ˆαLDCP-NAS( ˜f P (w, α),

˜f C

b ( ˆw, ˆα, β))

=∂G( ˆw, ˆα, β)

ˆα

+ λDα)

ˆα

=∂G( ˆw, ˆα, β)

ˆα

+ 2λ(∂G( ˆw, ˆα, β)

ˆα

˜f P (w, α)

∂α

)2G( ˆw, ˆα, β)

ˆα2

=∂G( ˆw, ˆα, β)

ˆα

+ 2λ(∂G( ˆw, ˆα, β)

ˆα

˜f P (w, α)

∂α

)HGα),

(4.33)

ˆwLDCP-NAS( ˜f P (w, α),

˜f C

b ( ˆw), ˆα, β) =∂G( ˆw, ˆα, β)

b ˆw

b ˆw

ˆw ,

(4.34)

where

b ˆw

ˆw = 1| ˆw|≤1.

(4.35)

and λ is a hyperparameter; H ˜

fbα) =2 ˜

fb( ˆw,ˆα)

ˆ

α2

denotes the Hessian matrix. The DCP-NAS

process is outlined in Fig. 4.12.

We minimize the difference between the gradient (tangent direction) ∂G( ˆw,ˆα,β)

ˆα

and the

gradient (tangent direction) ˜

f P (w)

∂α

, to search the architectures (both real-valued NAS

and binary NAS) in the same direction to generate a better 1-bit architecture. Note that α

inherits from ˆα at the beginning of each real value NAS iteration, indicating that we only

utilize w and α of real value NAS for heuristic optimization direction for 1-bit NAS instead

of looking for a better architecture for real value networks. Since a better tangent direction

˜

f P (w)

∂α

is achieved, DCP-NAS can have a more suitable ˆα for binary networks. We note

that α is different from ˆα, which is not an optimized architecture for real-valued weights

but an optimized architecture for 1-bit weights.

The expression above contains an expensive matrix gradient computation in its second

term. Thus, we introduce a first-order approximation of the Hessian matrix to accelerate

search efficiency in Section 4.4.5.

4.4.5

Generalized Gauss-Newton Matrix (GGN) for Hessian Matrix

Since the Hessian matrix is computationally expensive, this section mainly tries to accelerate

the calculation of the aforementioned Hessian matrix by deriving a second-order expansion

based on Eq. 4.34.